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The mutation and selection of regulatory DNA sequences is presented as an ideal model system of 
molecular evolution where genotype, phenotype, and fitness can be explicitly and independently 
characterized. In this theoretical study, we construct an explicit model for the evolution of regula- 
tory sequences, making use of the known biophysics of the binding of regulatory proteins to DNA 
sequences, under the assumption that fitness of a sequence depends only on its binding affinity to 
the regulatory protein. The model is confined to the mean field (i.e., infinite population size) limit. 
Using realistic values for all parameters, we determine the minimum fitness advantage needed to 
maintain a binding sequence, demonstrating explicitly the "error threshold" below which a bind- 
ing sequence cannot survive the accumulated effect of mutation over long time. The commonly 
observed "fuzziness" in binding motifs arises naturally as a consequence of the balance between 
selection and mutation in our model. In addition, we devise a simple model for the evolution of 
multiple binding sequences in a given regulatory region. We find the number of evolutionarily 
stable binding sequences to increase in a step-like fashion with increasing fitness advantage, if 
multiple regulatory proteins can synergistically enhance gene transcription. We discuss possible 
experimental approaches to resolve open questions raised by our study. 
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I. INTRODUCTION 

The regulation of gene expression involves many differ- 
ent proteins known as transcription factors which bind 
passively to specific sites on t he genomic DNA (see, e.g. 
iGerhart and Kirschnerl <)l997j) ). In bacteria, each such 
site (called 'operator') typically consists of a contigu- 
ous sequence of 20 — 30 nucleotides which binds a spe- 
cific transcription factor with much higher affinity than 
would a random DNA sequence of compa rable length 
l)Stormo and Fieldsl Il993 Ivon Hippeill979|) . Known ex- 
amples of different operators for the same factor usually 
differ from the maximum affinity binding sequence in a 
number of positions, typically in as many as 20% to 30% 
of the significant positions that contribute most to the 
specificity of the interaction. The ensemble of viable 
binding sequences is collectively referred to as the bind- 
ing "motif" for a factor; its "fuzziness" creates a difficult 
computational problem for the pr ediction of binding site s 
via informatic methods (see , e.g.. lLawrence et~ 1 (119931k 
IStormo and Hartzelll (fl989) : and references therein). In 
many known single regulatory region contains 

multiple operators for the same factor, each of which de- 
viates from the maximum affinity binding sequence. 

Why are the motifs fuzzy? One possible scenario is 
that the binding affinity of each operator is tuned evo- 
lutionarily to maximize the function of each regulated 
gene or operon. An alternative scenario is that the func- 
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tion is insensitive to the detail of the binding affinity 
as long as it is above some threshold. In the former 
case, fuzziness in the binding arises due to the partic- 
ular distribution of functional requirment. In the latter 
case, binding sequences in different regulatory regions are 
deemed "equal" , and fuzziness results from maximizing 
the sequence "entropy". While anecdotal examples of 
both cases are known, understanding whether either case 
dominates in biology is of interest not only for its own 
sake, but also very important for the choice of proper 
informatics tools for motif finding. In this paper, we de- 
scribe a detailed theoretical study of the latter case from 
an evolutionary perspective, recognizing that as with any 
other portion of the genome, the binding sequences are 
subject to the opposing forces of mutation and selection 
over evolutionary timescales. In particular, we address 
the quantitative question of how large a selective advan- 
tage the presence of a binding motif needs to provide, 
in order to guarantee its survival against mutations, and 
how large an advantage before multiple motifs are justi- 
fied. To make the study concrete and explicit, we will 
mostly confine the discussion to gene regulation in bac- 
teria or phages, and focus on the binding of one specific 
transcription factor to its operator(s) in the regulatory 
region of one specific gene or operon. We will not treat 
the interactions among different factors, since in bacteria 
such as E. co li, the majority of genes are regu lated by a 
single factor ijGralla and Collado-Videslll996j) . 

Another motivation for our study is that the evolu- 
tion of transcription factor binding motifs seems to be 
a well-suited starting point for an attempt to establish 
a link between the microscopic molecular mechanisms in 
the cell and the "macroscopic" principles of evolution: In 
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general, the most important ingredient in an evolution- 
ary study is to relate the genotype on which mutation 
acts to the fitness of the organism on which selection 
acts through some quantifiable phenotype. This relation 
is particularly simple for the operator binding problem 
at hand, where a natural choice of the phenotype is the 
binding probability of the transcription factor to the op- 
erator. Regardless of whether the factor acts as an acti- 
vator by attracting a polymerase to transcribe the gene, 
or as a repressor to block transcription, it can function 
only when it is bound to its operator. The fraction of 
time an operator is occupied in equilibrium is given by 
the binding probability. To regulate the transcription of 
the gene, e.g., in reaction to a change in the environmen- 
tal conditions or in order to trigger a different phase of 
the cell cycle, the cell changes the factor-operator bind- 
ing probability by varying the concentration of the (ac- 
tivated) factor inside the cell. The concentration may 
vary from practically zero in the "OFF-state" to typi- 
cally several hundred copies per cell in the "ON-state". 
We will make the reasonable (but critical) assumption 
that the fitness gain an operator contributes to the or- 
ganism depends solely on the binding probability P in 
the ON-state, with the value of P itself determined by 
the actual sequence of the operator through the binding 
energetics. 

For a few exemplary transcription factors, the 
variation in binding affinity upon mutation of the 
binding sequenc e has been studied in great detail 
experimentally dFields et all Il997t lOda et all Il998t 
ISarai and Takedal ll989tlTakeda et all Il98fll> . In partic- 
ular, Fields, Stormo, and coworkers have shown for the 
case of the mnt repressor that its binding (free) energy 
is approximately a sum over independent contributions 
from ea ch of the nucleotid e positions in the binding se- 
quence (|Fields et allll997|) . Typically, only 10 - 15 posi- 
tions in a binding site have a strong preference for specific 
nucleotides, while the other positions do not contribute 
significantly to the binding energy. Known binding se- 
quences display a fuzziness of up to 3 ~ 4 mismatches 
in these significant positions. A useful simplified 'two- 
state model' for transcription factor binding is obtained 
by taking only the significant bases into account and as- 
signing to each of them the same binding energy e, i.e. 
a match (to the optimal binding sequence) is favored by 
an energy difference e over a mismatch. This mo del, in- 
troduced long ago bv lvon Hinnel and Berd l|l98fif ). takes 
into account the effect of sequence-specific binding by a 
single parameter e, without reference to detailed bind- 
ing energies which have not yet been measured for most 
transcription factors. 

Based on the two-state model and our assumption on 
the contribution of the binding of the factor towards fit- 
ness, we construct an explicit theory for the evolution 
of the binding sequences. Within the mean-field ap- 
proach originally prop osed by Eigen in the context o f 
quasispecies evolution (jEigent Il97lt lEigen et all fl989). 
we characterize the balance between the opposing forces 



of selection and mutation quantitatively. We determine 
the critical selection pressure needed to keep a motif from 
mutating away, and show how the fuzziness in the motifs 
arises naturally above the selection threshold. We further 
apply the theory to investigate the frequently observed 
occurrence of multiple motifs in a given regulatory region, 
and elaborate on various plausible causes. Towards the 
end, we will provide extended discussions on experimen- 
tal approaches to pursue the open questions suggested by 
this study. 

II. MODEL AND EQUATIONS 

We focus on the operator sequence located in the reg- 
ulatory region of a gene of interest. By assumption, this 
gene is regulated by a single transcription factor. Let 
S = {Si, S2, Sl} denote the L significant nucleotides 
of the operator which specify transcription factor bind- 
ing. We will keep the alphabet size, A, as a variable 
in our equations, since, as we will see below, this facili- 
tates the intuitive understanding of certain dependencies; 
however A = 4 and Si G {A, C, G, T} is the only case of 
interest here. To describe the evolution of S in a popula- 
tion of bacteria or phages, we need to specify the action 
of selection and mutation. 



Selection mechanism 

It should be clear that gene regulation is only needed 
in the presence of a changing cellular state, triggered ei- 
ther internally, e.g. cell cycle, or externally through a 
change in the environment. Hence to study the fitness 
of a regulatory mechanism, we must invoke at least two 
different states. Selection arises when the growth rate 
of an organism depends on the probability Pg that the 

factor binds to the sequence S in the state that prefers 
factor binding (the "ON-state"). For the sake of con- 
creteness, let us consider an environment that oscillates 
between two states. We assume that in State A (the ON- 
state), the environment induces a certain concentration 
of activated factors, say on average Ntf factors per cell 
(these may either be produced upon entering State A, or 
preexisting factors may be activated for binding by induc- 
ers that cause an allosteric transition). Let the growth 
rate or "fitness" of the organism in this state be 4>a if 
the factor is never bound (binding motif not present), 
and 4>a + $4>A > i>A if the factor is always bound (see 
Table 1). When the environment is in State B (the OFF- 
state), let the fitness be if the factor is never bound 
and <Pb — StpB < <^b if the factor is always bound. In 
the following we will assume that the concentration of 
activated factors in State B is practically zero, so that 
the operator is never occupied (hence the parameter 5<j>B 
does not enter our model). 

An example for the general situation described above 
could be the binding of the /ac-repressor to its operator 
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State A 


State B 


factor unbound 






factor bound 


4>A + &4>A 


4>B — &4>B 



TABLE I Fitness of the organism in two different cellular 
states, with or without the binding of the transcription factor. 



in the Zac-operon of E. coli. In this case, the ON-state 
would be the glucose-rich environment, and the OFF- 
state would be the glucose-poor and lactose-rich envi- 
ronment. 4>a,b would be the growth rate of E. coli in 
the two environments in the absence of the /ac-repressor. 
8<pA would be the increment in fitness when the unnec- 
essary Zac-operon is turned off, and 4>b — 5<t>B ~ is the 
deleterious situation when lactose is present as the main 
source of sugar but the Zac-operon is not operative due 
to the undesirable binding of the repressor. 

In this study, we will discuss mainly the time-averaged 
effect over evolutionary time scales, which are much 
larger than the time scales of cellular or environmental 
fluctuations. We choose r/ln2 as our unit of time, with 
t denoting the average generation time in the absence of 
the factor, so that the time averaged growth rate there 
can be set to 1. We assume that the cell can quickly ad- 
just the cellular concentration of the factor 1 so that the 
operator with sequence S is occupied with probability Pg 
in the ON-state and never occupied in the OFF-state. It 
is then plausible to assume that the time averaged growth 
rate depends linearly on Pg (see also the discussion 
in the section entitled 'Selection threshold and fuzzy mo- 
tifs'), 
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a-Pg 



(1) 



Here, a is a dimensionless parameter which characterizes 
the selection pressure on the binding sequence S. In the 
limit a <C 1, there is hardly any selection pressure on the 
sequence at all; the opposite limit a — > oo corresponds to 
the case where the failure of the factor-operator binding 
is lethal to the organism. If the fraction of time the 
bacteria population encounters environment A is /, the 
selection pressure becomes 



a = f ■ 



(2) 



1 In the present article, we do not consider the 'search prob- 
lem' of how a transcription factor locates its operator among 
millions of other sites on the DNA (see 
IWinter and von Hippeil Il98ll) : IWinter et~ 



Berg et alj 11981] 
19811) for a thor- 
ough experimental and theoretical investigation of this problem, 
and Gerland, Moroz, and Hwa (2001, submitted for publication) 
for a bound on the protein-DNA interaction parameters that re- 
sults from the requirement of reasonable search times). Rather 
we treat protein-DNA binding as an equilibrium process charac- 
terized only by the binding probability. This is justified by the 
fact that the search time is typically on the order of 1 min, which 
is much smaller than the characteristic time scale of changes in 
gene expression. 



In an experiment, a can be adjusted according to Eq. @ 
by changing the fraction of time / the ON-state is pre- 
sented. Below, we will investigate the statistics of the 
selected sequence S for a wide range of a's. 



Mutation process 

We consider only single-nucleotide substitutions, and 
focus on mutations in the binding sequence S, assuming 
that the net result of mutation and selection on the rest 
of the genome gives the overall background fitness of 1 
(with our time unit of r/ln2). Furthermore, we neglect 
the difference between transversions and transitions, and 
assume a constant rate vq at which a base mutates into 
any other base. The total mutation rate of a site of length 
L is then v = v§ L. For bacteria such as E. Coli, vq is 
on the order of 10 -9 under normal conditions and hence 
v ~ 1CP 8 . The mutation rate is larger for RNA viruses 
which rely on the less accurate reverse transcriptase for 
replication. For that case, v$ is in the range 10~ 5 to 10~ 4 
and hence v ~ 10~ 4 — 10 -3 . 



Binding probability 

As mentioned above, the binding (free) energy Eg of 
the transcription factor to the binding sequence S is 
given, to a good approximation, by a sum over inde- 
pendent contribut ions from each nucleotide at the L sig- 
nificant positions ({Fields et al.lll99'?1:IStormo and Fieldsl 

Il998|) . i.e. Eg = Ef=i Each of these Po- 

sitions typically prefers a particular nucleotide by a 
binding energy of several fc^T's (we exclude from S 
those positions which contribute only a fraction of 
ksT towards the total binding e nergy) . Furthermore 
we adopt the 'two-state model' l)Berg and von Hippeil . 
ll987tlvon Hippel and BereLll986|) . by assuming that each 
Ei(Si) can only take on two possible values, if Si 
matches the preferred base S*, or e > for a mismatch, 
i.e. Si(Si) = e ■ (1 — 8s u s*)- The binding energy of a site 
S is then only a function of the number of mismatches, 
or Hamming distance rg = \ S — S*\, from the optimal 
binding sequence S* , i.e. 2 , 



E s = E ( r s) = r s £ 



(3) 



Given its binding energy, the average occupancy of a site 



2 Note that the approximate linear relationship between the bind- 
ing energy and the number of mismatches, Eq. breaks 
down when E reache s a certain non-specific binding energy E n3 
IWinter et allll98lD . Beyond this point, the binding energy re- 
mains constant. However, the expression nevertheless pro- 
vides an useful description of the binding probability over the 
entire range < r < L, since P(r) is essentially zero when 

E = E ns ■ 
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is determined by equilibrium thermodynamics. Since a 
binding site can only be occupied or unoccupied (but not 
multiply occupied), its binding probability Pg = P(rg) 
is given by a Fermi function, 



P(r) 



1 + e e ( r_r o)/ fciE 



(4) 



which is also known as Arrhenius function, see, e.g., 
lAtkind l)l998|h Here, /x = ero is the chemical poten- 
tial for the transcription factors in the ON-state (this 
function is plotted in Fig. ^ (right) with realistic param- 
eter values). Note that ro corresponds to the number of 
mismatches for which the probability of binding is 50%. 

In total, we are left with three dimensionless pa- 
rameters for the two-state model of protein-DNA bind- 
ing: L, e/ksT, and ro. As mentioned before, the 
number of significant positions in a binding site is 
typically in the range 10 < L < 15. By inspec- 
tion of the known binding ene r gies for exemplary tran- 
scription factors (Fields et^alj^ 119971 lOda et all Il998t 
ISarai and Takedalfl989HTakeda et al.Lll989|) , we find the 
mean specificity of the significant sites to be typically 
e = 1 ~ 3k B T. In (Gerland, Moroz, and Hwa, 2001, 
submitted for publication), it is argued on rather gen- 
eral ground that this is actually the optimal range of e 
for the transcription factors. The chemical potential fi 
depends directly on the average number of factors A/tf 
in the cell; the work of Gerland, Moroz, and Hwa (2001, 
submitted for publication) suggests that it can be ap- 
proximated by (X w fx + fc s Tln[iVTp], where fXo repre- 
sents the binding free energy of a single factor to the rest 
of the genome. For those factors whose binding energies 
Ei(Si) have been measured, we find /io « —0.8ksT (mnt) 
and /io ~ — 1.9fcsT (lambda-repressor and cro). Hence, 
H w k B Tln[N TF }; see Fig. □ for details. For e = 2k B T 
and N TF = 50 ~ 5000, we get r = fi/e = 2 - 4.3. 
Clearly, ro is the parameter that we have the least in- 
formation about; but we see that it has a limited range, 
and in any case, most of our qualitative conclusions will 
be insensitive to the specific choice of ro. [Note that 
the above analysis is for factors that have a binding 
site only in a single regulatory region. For those fac- 
tors which are global regulators and have many operators 
located throughout the genome (e.g., the factor CRP in 
E. coli) , the number ]Vtf above needs to be appropriately 
adjus ted by the number of operators (|Sengupta et all 
200.3).] 
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FIG. 1 Left: chemical potential /i (in units of ksT ~ 
0.6kcal/mol) as a function of the number of factors A^tf 
in the cell. The solid lines indicate the result of a numer- 
ical c alculation using the Mnt energy matrix iFields et all 
11997ft on the Salm onella genome, the cro energy matrix 
(iTakeda et all Il 989ft. and the l ambda repressor (cl) energy 
matrix JSarai and Takedalli~989T) on the E. coli genome. Here, 
we assumed that all of the factors in the cell are bound to 
DNA. The dashed line delineates the slope ksT which would 
be expected from the considerations of Gerland, Moroz, and 
Hwa (2001, submitted). Right: binding probability versus 
the number of mismatches from the best binding sequence, 
according to Eq. with realistic parameters. 



Ng(t). The time evolution of Ng(t) is described by 



d 



vN s (t) + $ s N s (t) 



(•5) 



The first term on the right hand side describes the mu- 
tational flow into Ng from all sequences S' that are a 
single-nucleotide mutation away, while the second de- 
scribes the reverse process. The third term represents the 
(time-averaged) selection/amplification process. Eq. (JjJJ 
is similar to the "para-muse model" considered in a dif- 
ferent context bv lBaake etHI l)l997|) . 

Since the fitness function JQ) depends on the sequence 
S only through the binding probability Pg, which de- 
pends only on the number of mismatches r according to 
Eq. it is advantageous to introduce a 'radial distri- 
bution' N(r,t) in the (discre te) Hamming distance space 
llNowak and SnhnstfiilllflSflh 



(6) 



Evolution equation 

In this study, we will focus on the steady-state proper- 
ties of the mutation/selection process defined above. For 
large population size and close to the steady-state, we 
may consider only the dynamics of the average popula- 
tion and neglect fluctuations due to the discreteness of 
the individual organisms. We denote the average num- 
ber of individuals at time t with binding sequence S by 



With <p(rg) = &g denoting the 'radial fitness' function, 
the evolution equation for N(r, t) becomes 



_5 



N(r,t) = (f>(r)N(r,t) 



V{) 



■A, 



A-l 

is Q A r (L-r)N{r,t) 



(r + l)N(r + l,t) 



(7) 



where A r [/(r)] = f(r) — f(r — 1) denotes the discrete 
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derivative, and 

0(r) = 1 + aP(r) (8) 

is a mesa-shaped fitness landscape. Eq. (JJJ is obtained by 
observing that there are (L — r){A— 1) ways to mutate a 
site with r mismatches into a site with r+1 mismatches, 
r ways to mutate it into a site with r — 1 mismatches, 
and r(.A — 2) ways to mutate a site without changing the 
number of mismatches. 

We will characterize the predictions of our model 
by numerically integrating the discrete radial evolution 
equation (JJJ) using the set of realistic parameters given 
above. However, to gain insight about the qualitative 
behavior of the model, we also analyze the continuum- 
space evolution equation obtained in the limit of large 
L, 



d , , d 
0t n(r,t) = d~r 



D(r)-^n(r,t) 



v(r) n(r, t) 



+ ^{r)n{r,t), (9) 

where we used n{r, t) and Tp(r) to denote the continuum 
generalization of the functions N(r, t) and cf>(r) respec- 
tively. Note that the mutational dynamics is locally con- 
servative, with a local current j(r,t) = D{r) d r n(r,t) — 
v(r)n(r,t). The appropriate boundary conditions are 
j(CM) = and jCM) = 0. 

The continuous radial evolution equation reduces 
the evolutionary dynamics to a simple one-dimensional 
drift-diffusion equation, where the 'diffusion coefficient' 
D(r) and the 'drift velocity' v(r) are explicitly given by 

A-2 r' 



D{r) 



v(r) 



5" 



1 



A-l L 
A 



A-l L 



(10) 



(11) 



Note that both D and v are proportional to the over- 
all mutation rate v = isqL, with only weak dependence 
on r for r <C L. The drift velocity drives the distribu- 
tion away from the optimal binding site at r = 0, simply 
because the number of sequences with a fixed number 
of mismatches r increases quickly with r. This purely 
cntropic bias changes sign at (A—1)L/A, which is the 
average number of mismatches in a random binding se- 
quence. Also note that the r-dependence of the diffusion 
coefficient disappears when A — 2, because for a two- 
letter alphabet, every mutation implies a change in r. 
For A > 2, there are mutations which do not change the 
Hamming distance and hence do not affect the diffusion 
process. This effect is reflected in the reduction of D in 
Eq. O- 

Our continuous radial evolution equation JSJ) is some- 
what reminiscent of the evolution equation in fitness 
space introduced by Tsimring, Levine, and Kessler 
i|Tsimring et all Il996|) in a general population genetics 
context. However, with our concrete model for protein- 
DNA binding, we can work directly in genotype space, 
which will enable us to make explicit predictions on the 
behavior of the binding sites. 



III. SELECTION THRESHOLD AND FUZZY MOTIFS 

In this section, we use the evolutionary model Q de- 
scribed in 'Model and Equations' to address the following 
questions: How large a selection pressure is needed for 
the maintenance of binding motifs? And can the fuzzi- 
ness of the motifs be accounted for by the balance be- 
tween mutation and selection? We will first provide an 
analytical solution to a simplified continuum model, and 
then show by numerical simulation that the qualitative 
features of the solution hold even for small system such 
as L = 10. We will compare these results to available 
data and discuss experimental ramifications. 



Analytical results 

Various properties of the continuum model specified 
by Eqs. I|9I11|I can be obtained exactly 3 . Here, we will 
present the results and discuss various qualitative fea- 
tures of the solution, in particular, the existence of a crit- 
ical selection pressure for the maintenance of the binding 
motifs. Even though the continuum model is meaningful 
only for L > ro > 1, we will see from numerical sim- 
ulation that the qualitative features are valid even for 
the more realistic parameter range where ro is not much 
larger than one, and L ~ 10. 

For the analytical study, we neglect the r-dependence 
of the diffusion coefficient (|10|) and the drift velocity (TTJ , 
and use D = v/2,v = v. This is justified as long as tq <c 
L since as we will see, most of the interesting "action" of 
this system occurs around r = vq. Eq. |J3J| then reduces 
to the asymmetric "quantum well" problem well studied 
in the context of various s tatistical mechanics problems 
l|Hatano and NelsonL Il997l) . [It differs from the DNA un- 
zipping problem studied bv lLubenskv and NelsonI (2000) 
only by an (unimportant) boundary condition at r = 0.] 
An explicit solution can be obtained by further approx- 
imating the Fermi function by the Heavyside step 
function 8(r), such that the fitness landscape becomes 



(r) = l+ad{r -r). 



(12) 



This idealized form of the fitness funct ion is known as 
truncation selection ijKondrashovl Il988fl . 

The solution of this simplified continuum problem is of 
the form n{r,t) = no(r)e 7 ', where no(r) is the station- 
ary distribution associated with the largest growth rate 
7. It is controlled by one dimensionless parameter, the 
effective selection pressure 



2a 
v 



(13) 



3 Inspired by the present system, solution of the mean-field evo- 
lution model for general mesa-like fitness landscape has recently 



been developed bv lPelitil [2002). 
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We have 7 = 1 if a is below the critical value 

V 2 

a c = 1 + -f , 



(14) 



where r\ c is of order one and depends only weakly 4 on tq. 
In this regime, n (r) is given by the continuum version 
of the (skewed) binomial distribution 



n(r) = (A- iy 



(15) 



as if the fitness plateau at r < ro is not present. For 
a > a c , the solution is given in terms of the eigenvalue 
problem 



y"(r) + a 6>(r - r) y(r) = A y(r) 



(16) 



with the boundary condition y(0) — y'(0), where y(r) is 
the eigenfunction corresponding to the largest eigenvalue, 
A(5), which must exceed 1. [Thus, the precise definition 
of 5 C is A(5 C ) = 1.] In this regime, the growth rate 
becomes 7 = 1 + (A — V)vj1 > cf> , and the stationary 
distribution is no(r) = y(r)e r . The form of the latter 
can be straightforwardly analyzed for large a's (such that 
A > 1). It is strongly peaked at r* < ro, indicating that 
the motifs are marginally conserved or maximally fuzzy 5 
above the selection threshold. 

A phase transition occurs at a = a c where the sta- 
tionary distribution switches from being mostly confined 
in the region r < ro (localized phase) to the binomial 
distribution (delocalized phase) when A approaches 1. 
This phase transition belongs to the same class of tran- 
sitions as the one descr ibed by Eigen in the context of 
quasi-specie s evolution l)Eigeii Il97ll lEieen et all Il989t 
lHiggslll994|) . The critical selection pressure a c ~ vq ■ L is 
recognized as the well-known form of the "error thresh- 
old" . Note also the dependence of a c on ro as given in 
Eq. (|14fl : a c decreases upon increasing ro, and since A is 
a monotonously increasing function of a — a c , the effec- 
tive growth rate 7 will also increase. This implies that 
a wider fitness landscape has a smaller mutational load 
and a larger effective fitness, whic h is a known result, see, 
e.g. ISchuster and Swetinal l)l988|) . 

The "order parameter" of the phase transition is the 
average number of mismatches in the stationary state, 
(r) = J rno(r)/ J r no{r). In the localized phase, (r) ~ 
ro, while (r) = (A— 1)L/A — > 00 in the delocalized phase. 
When a approaches a c from above, (r) diverges as 



(r) oc (a 



for 



a > a c . 



(17) 



4 For 1 « rj « L, i) r is given to a good approximation by the the 
solution of the equation r\ c = — ro ■ tan(r; c ) and hence r\ c £ [J, 7r] 

5 In the context of protein folding, it has been pointed out 
by R. Goldstein that the balance of mutation and selection 
may lea d to maximal fuz ziness in the space of amino acid se- 
quences iGoldsteinl 12001ft . Our results are similar in spirit, but 
more explicit due to the simplicity of the protein-DNA binding. 
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FIG. 2 The stationary distribution function No(r) for the 
discrete model with L — 10, ro = 3, and e = 2fc_eT. At 
a — 0.5, the distribution is indistinguishable from fi(r). 



indicating that this is a second-order phase transition 6 . 



Numerical results 

In order to test whether the behavior derived above 
for the simplified continuum model holds approximately 
also for the discrete model (J2J) with realistic parameters, 
we performed a number of numerical studies. We deter- 
mined the steady-state distribution iVo(r) of Eq. Q over 
a range of values of a for two sets of parameters: (a) a 
nearly-continuum model, with L = 1000, ro = 30, and 
the step function landscape l|12f> : and (b) the discrete 
model with L = 10, ro = 3, and the Fermi function land- 
scape (JSJ with e = 2ksT. Fig. shows the stationary 
distribution N (r) for the discrete model in the delocal- 
ized regime (a = 0.5), in the localized regime (3 = 3.0), 
and in the crossover 7 region in between (a — 1.4). We see 
that No(r) is peaked slightly below ro = 3 in the localized 
regime, and becomes indistinguishable from the binomial 
distribution (|15fl in the delocalized phase. Note that the 
distribution is broad in the crossover regime, which is 
consistent with the finding of a continuous second-order 



It should be noted that both the critical value ct c and the di- 
vergence of (r) near a c are modified if one explicitly includes 
the time dependence of the fitness landscape. In particular, if 
we take the fitness to be <f>(r, t) oc f(t)P(r) (f(t) = 1 in the ON- 
state and f(t) = in the OFF-state), with a stochastic f(t), then 
the evolution dynamics becomes equivale nt to the class of time- 
depen dent depinning problems studied bv lLubenskv and Nelson! 
12000), with the critical behavior (r) oc (a — ct c )~ 2 . 
With a small system such as L = 10, the critical point of the 
phase transition becomes a crossover region, in which the behav- 
ior of the stationary state changes smoothly. 
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FIG. 3 (a) The order parameters ro/{r) for the nearly- 
continuum model (diamonds) and discrete model (triangles). 
The transition is quite pronounced even for L = 10. The in- 
set shows a magnification of the critical region for the nearly- 
continuum model, which confirms the critical exponent of one 
in Eq. 1171 . (b) The dependence of the threshold a c on ro 
for the discrete model. The result a c ~ 1 is in qualitative 
agreement with the formula 1141 derived for the continuum 
model. 



transition in the continuum model (see the last section). 

To make the comparison more quantitative, we next 
examine the order parameter (r). Fig.|3Ja) shows r /(r) 
plotted as a function of 5, for the discrete model (trian- 
gles) and the nearly-continuum model (diamonds). The 
nearly-continuum model displays a sharp transition at 8 
a c Ri 1.6. The sharp transition becomes a pronounced 
crossover for the discrete model, but still with a relatively 
well-defined threshold a c . The r -dependence of a c is 




The critical value for the nearly-continuum model deviates some- 
what from the analytical result 1141 . This deviation is due to the 
r-dependence of v and D, which we neglected in the analytical 
solution of the continuum model. 



FIG. 4 The peak position r* of the stationary distribution 
no(r) for the Fermi function landscape © with L = 10, ro = 
3, and e = 2k B T. 



plotted in Fig. [Jb) over the relevant interval 1 < ro < 5 
(here, we have defined the threshold a c as the value of 
a where the derivative of r /(r)(5) is maximal). We see 
that it is relatively insensitive to the precise value of ro, 
with a c ~ 1 as given qualitatively by the formula l|14(l . 



Viral and bacterial evolution 

We expect the selection threshold described above to 
be detectable in evolution experiments with RNA viruses. 
The total mutation rate v for the binding site for RNA 
viruses is of the order 10~ 3 ~ 10~ 4 for a binding sequence 
of length L = 10. Assuming that the fitness gain of the 
virus in the ON-state (i.e., the factor Sep a in Eq. (J2J) is 
of the order of 1% ~ 10%, then the effective selection 
pressure a — 2 j '6<f> a I 'v on the viral regulatory sequence 
becomes of the order a c ~ 0(1) if the fractional exposure 
/ to the ON-state is set at a few percent level. By vary- 
ing / over the range of several percents, we expect that 
the phase transition should be observable. Moreover, the 
anomalous dependence (see footnote 4) of the selection 
threshold on the temporal variation f(t) should also be 
observable by applying controlled temporal changes to 
the environment. The stationary distribution No(r) it- 
self can be monitored in principle by sequencing a rea- 
sonable number (say 100) viral regulatory sequences after 
stationarity is reached. 

A very different situation is expected for the evolution 
of bacteria or even DNA viruses. The total mutation rate 
v is of the order 10~ 8 for bacteria and 10 -6 for DNA 
viruses. Consequently, a is expected to be four orders 
of magnitude larger than a c for bacteria and two orders 
larger for DNA viruses. What is the behavior of the 
discrete model at such large values of a ? In Fig. ^ we 
show the position of the peak r* of the distribution Nq(t) 
obtained numerically as a function of 5? on a logarithmic 
scale. For values of a exceeding ~ 140, we find the peak is 
pushed down to r* =0, contrary to the fuzziness depicted 
in Fig. 
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FIG. 5 The stationary distribution No(r) for the infinite- 
mesa landscape (I18H with ro = 3 (histogram). For comparison 
the density of states, i.e. the binomial distribution l|15|l . cut 
off at r — ro, is also shown (dashed line). 



This behavior is obviously an artifact of the specific 
feature of the Fermi function landscape used in JSJ: for 
very large 5's, there is an incentive for the distribution 
to move towards small r's due to the very slight increase 
in the value of P(r) for smaller r's. But it is unrea- 
sonable to expect that the simple relation between the 
binding probability P(r) and the fitness function <p(r) 
assumed in this study to hold down to very small dif- 
ferences in P(r). Aside from various kinetic effects of 
binding and temporal variations of the environment that 
we have neglected, fluctuations due to finite population 
size (e.g., genetic drift) simply do not allow for the pop- 
ulation to resolve the very small differences in fitness due 
to the small differe nces in P(r) ; see the theory of nearly- 
neutral evolution l|OhtaL 11992ft . Thus, <j>(r) should be 
effectively r-independent for small r's. This can be im- 
plemented by replacing 4>{r) by a constant value 4>(r' ) 
when </>(tq — 1) — (j>(r ) is below some resolution limit (set 
by the effective population size of the organism.) 9 For 
low mutation rate (or large 5's), this amounts to replac- 
ing the fitness function by an infinte square well: 



= 



a —> oo if r < ro 
if r > r 



(18) 



The stationary distribution obtained in this case depends 
only on the width of the well ro, and is shown in Fig.JSJfor 
r = 3. Note that it is highly peaked at r as expected. 
Hence the binding sequence is fuzzy even as 5 — > oo. 
However, it is different from simply truncating the bino- 
mial distribution 115(1 for r > ro due to the mutational 
load, i.e., a fraction of the population with r — ro will 



This modification of the fitness function should in principle also 
be applied to the case of RNA virus evolution in the vicinity 
of the phase transition. However, it wouldn't make much of a 
difference there because the distribution would be already peaked 
away from small r. 



FIG. 6 Histogram of mismatches of known CRP (singlet) 
sites from the consensus sequence TGTGA TCACA. 



receive an additional deleterious mutation and not make 
it to the next generation. 



Comparison to known sites 

It is useful to compare the above solution to biological 
data. Unfortunately, polymorphisms in the same bind- 
ing sequence across different strains of a bacteria species 
are not yet readily available. Instead, we assume that 
the different binding sequences (of the same transcrip- 
tion factor) located in different regulatory regions across 
the genome may be viewed as a sample of the station- 
ary binding sequence distribution. This is clearly not 
the case if the selection pressure is small, since close to 
the selection threshold, even small differences in selection 
pressure experienced by the different binding sequences 
will produce different binding sequence distributions; see 
Fig. But this should not be a concern for bacteria 
since 5 ^> 5 C there. An obvious candidate is the binding 
sequences for the the well-known E. coli global regula- 
tor CRP (also known as the catabolite activator protein, 
CAP), which is act ivated under low cellular glucose level 
ijSaier et all ^996). There are over one hundred CRP 
sites in the E. c oli genome. We tak e from the Regu- 
lonDB database llSalgado et all |£000) a list of 28 sites 
which are biologically confirmed binding sites and ap- 
pear only once in a given regulatory region. (The case 
of multiple binding sites is discussed in the following sec- 
tion.) The drawback of using CRP sites is that CRP is 
hardly ever the only regulator in a target regulatory re- 
gion, and interaction with other transcription factors can 
complicate the situation. 

Fig. [f)| shows the histogram of the number of mis- 
matches of these CRP sequences from the consensus se- 
quence TGTGA TCACA. While it peaks at r* = 2 ~ 3 

similar to the corresponding distribution of the infinite- 
mesa model in Fig. [S] it is clear that the distribution of 
the CRP sites is broader. The few outliers at r = 0, 1, 4, 5 
may well be due to direct or subtle interaction with 
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other factors which we have not considered in this sim- 
ple model. The existence of nearly equal peaks at r = 2 
and r = 3 is more perplexing: According to our model, 
the distribution should be peaked at the largest possi- 
ble r. [For the L = 10 sequence, entropy favors r = 3 
over r = 2 by a factor of eight.] One possible cause of 
the discrepancy may be the deviation of the real binding 
energy matrix Ei(S) from the two state model. For in- 
stance, suppose the chemical potential fj, in the ON-state 
is such that r = /i/e is slightly below 3. Then accord- 
ing to the pure two-state model, the maximum number 
of allowed mismatches is 2. However, small deviations in 
the binding energies from e will allow a maximum of 3 
mismatches in a subset of the L positions, thereby pro- 
ducing a distribution peaked at both r = 2 and r = 3 
as shown in Fig. The actual stationary distribution of 
r can be easily computed numerically if the energy ma- 
trix is known. However at present, the authors know of 
no example of a transcription factor whose binding en- 
ergy matrix is measured and a large number of binding 
sequences are available. 

A very different explanation of the data in Fig.[fj]is the 
differential selection of each of the CRP motifs as alluded 
already in "Introduction" . Specifically, one can envision 
a situation where the single "ON-state" assumption we 
adopted is not valid, and instead the cell coordinates a 
graded response to cellular glucose shortage, requiring dif- 
ferent operons to turn ON at different (activated) CRP 
concentrations. [In this case, our assumption that the 
fitness function has a simple linear dependence on 
the binding probability obviously breaks down.] Within 
this scenario, the distribution in Figgis solely a result 
of the functional need of the cell, and its resemblance to 
the statistical distribution of Fig.[5]would be fortuituous. 
Distinguishing between the different plausibilities is im- 
portant and can be done by either sequencing the CRP 
binding sites in a variety of related strains to accumu- 
late statistics on polymorphism for each site, or perform 
site-directed mutagensis to specific binding motifs and 
directly measure the fitness function. In general, one 
may expect to find that the form of the fitness function 
depends on the biological function of the binding site. 
In particular, the form JJ) seems more likely to be ap- 
propriate for the case of transcriptional repressors than 
for activators, since repressors need to have a binding 
probability close to one, in order to efficiently suppress 
transcription from the promoter, which is active in the 
absence of the repressor. In the case of activators, the 
promoter has a very low basal level of transcription and 
even an activator with a relatively low binding probabil- 
ity can lead to a large effect on the transcription level. 



IV. MULTIPLE BINDING SITES 

It is well known that regulatory binding motifs often 
occur in doublets or even higher multiplets. For instance, 
the regulatory regions of the E. coli genes crp, dadA, 



dsdXA, fixA, glpFK, glpTQ, lac, manX, nagE, and tsx 
are some of the many regions that contain two CRP bind- 
ing sequences. Here, we extend our model to account 
for the possibility of multiple sites that bind the same 
protein and regulate the same promoter. We will pur- 
sue the question of whether we can interpret regulatory 
regions with multiplets as being under higher selective 
pressure for factor binding than regulatory regions with 
single binding sites. 

Some factors (e.g., A-repressor) bind cooperatively to 
binding sites, thereby effectively enhancing their DNA 
binding specificity. Cooperative factor binding can play 
an important and i nterestin g role in transcription reg- 
ulation (see, e.g., iPtashne (1992)), however it does 
so only for a fraction of the known multiplets, since 
many factors (such as CRP) have no binding domain 
for an attractive interaction between themselves. In the 
present study we exclude factor-factor interactions and 
explore possible selective advantages of multiple inde- 
pendent binding sites. This approach is similar in spirit 
to studies of gene duplication, which consider the evo- 
lution o f multiple copies of the same gene (see, e.g., 
IWagnerl <l2000m . One scenario could be that several 
bound transcription factors can simultaneously interact 
with polymerase to synergistically recruit (or repel) it 
more efficiently than a single factor would. For the 
case of CRP, this effect has been observed, and stud- 
ied in detail, experimentally l l Busbv and Ebrightl Il999t 
lLangdon and Hochschild|. Il999fl . An individual organism 
with a multiplet of binding sites for a factor then has a 
fitness advantage over one with a single binding site, if 
a strong activation (or repression) is beneficial for the 
biological function. Consequently, selection would favor 
multiplets over singlets. On the other hand, random mu- 
tations tend to destroy the binding motifs, so that an 
additional motif is maintained only when its contribu- 
tion to the fitness is sufficiently high. In the following, 
we explore this scenario within our model. 

Let us assume there are two binding sites in a certain 
regulatory region and ask whether they will be main- 
tained by evolution. We begin by constructing a 'two- 
site fitness function' that makes the selection mechanism 
outlined above explicit. As in the previous sections on 
the single-site problem, we assume that the state of the 
bacterium/virus alternates between an ON-state, where 
factor binding leads to a fitness gain, and an OFF-state, 
where factor binding has a negative effect. In the ON- 
state, let the fitness gain due to factor binding be StpAi, 
S(f>A2, or S(pAi2, if a factor is bound to site 1 only, site 
2 only, or both sites, respectively. Using the same argu- 
ments as for the single site case, the time-averaged fitness 
becomes 

$= I + a ■ [P!(l - P 2 ) + a P 2 (l - Pi) + lu P^} , (19) 

where Pi, P 2 denote the probabilities that a factor is 
bound to site 1, 2, which depend on the respective se- 
quences (we neglect the possibility of cooperative bind- 
ing at this point). The selection pressure, a, has again 
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the form Pjl. with a — f 5<j)Ai, while the 'synergism fac- 
tor' uo describes the fractional fitness advantage of two 
bound factors over just one, i.e. u> — 6<f>Ai2/5<j>Ai- I n the 
remaining term, the dimensionless coefficient a consti- 
tutes an 'asymmetry factor' equal to the relative fitness 
gain 5<pA2/S<pAi (i-e. when a ^ 1 one may distinguish 
between a more "important" and a less important site). 
Note that not only the selection pressure, but also u> and 
a may vary between different regulatory regions, even 
when they are controlled by the same factor, since both 
depend on the location of the binding sites with respect 
to the promoter and on the sequence of the promoter 
itself (see the discussion at the end of this section) . 

As in the single site problem, we work in the two-state 
model approximation (see section 'Binding probability'), 
so that the binding probabilities P\ and P 2 only depend 
on the number of mismatches r% and r 2 in the respective 
site, and take the form When the selection pressure 
a is much larger than the mutation rate v (as we typi- 
cally expect in the case of bacterial evolution) , we again 
invoke the argument that very small differences in the fit- 
ness function are hardly resolvable by finite populations, 
and therefore the fitness function should become neutral, 
i.e. r-independent, at small r% and r 2 . This effectively 
amounts to using step functions for the binding proba- 
bilities, i.e. P(r\,2) — 1 for r\ j2 < r$ and P{r\$) = for 
f\.2 > t~q. The two-site fitness function in (ri, r 2 )-space 
is then 



(n,r 2 ) 



1 + a 
1 
1 



if ri <r < r 2 
acr if r 2 < r$ < r\ 
au if n, r 2 < r 
if ri, r 2 > r 



(20) 



In order to simplify the discussion in the following, we 
will use the fitness function I12()[I over the whole range of 
a, since it yields a correct description at large a, and at 
small a, it produces no qualitative changes in the behav- 
ior of the stationary distribution compared to the smooth 
fitness function with P(r) of the form (0}. 

It is straightforward to derive a two-site evolution 
equation analogous to Eq. (7J, which describes the 
approach of the average distribution of mismatches 
N(ri,r2,t) (neglecting fluctuation effects due to finite 
population size) to its stationary state No(ri,r 2 ). One 
obtains 



d_ 

di 



N{n,r 2 ,t) 



(ri,r 2 )N(r 1 ,r 2 ,t) 



+ z/ A ri (n + l)JV(ri + l,r 2 ,i) + 



-(A-l)(L-n)N(r u r 2 ,t) 
(r 2 + l)N( ri ,r 2 + l,t) + 
-(A-l){L-r 2 )N( ri ,r 2 ,t) . (21) 



In the continuum limit, Eq. I|21|) becomes a two- 
dimensional generalization of the (biased) diffusion equa- 
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FIG. 7 Illustration of the drift-diffusion dynamics for the 
two-site problem. The arrows indicate the direction and mag- 
nitude of the drift velocity v = (v(r\), vfr?)) in the continuum 
equation 1221 . while the shading corresponds to the fitness 
function (dark means high fitness; here we used u> — 2 for the 
purpose of illustration, and a = 1) 



tion @, 




—n(ri,r 2 ,t) = cp(n, r 2 ) n{n, r 2 , t) + 



+ 



d 

dri 
d 

dr 2 



D(ri)——n(ri : r 2 ,t) - v{rt)n(ri, r 2 , t) 
ori 

d 

D(r 2 )—n(ri,r 2 ,t) ~ v(r 2 )n(ri,r 2 ,t) 



,(22) 



where D(r) and v(r) are still given by Eqs. I jlOjl and 
(|ll|l and n(ri, r 2l i), Xp{r\, r 2 ) are the continuum general- 
izations of N(ri,r 2 ,t) and 4>(ri,r 2 ), respectively. Fig.0 
illustrates the two-dimensional (biased) diffusion dynam- 
ics that emerges from Eq. (|22|) . The fitness function 
has a high plateau or 'mesa' at n, r 2 < vq, and two 
strips of lower fitness along the r\ and r 2 axis. Hence 
selection tries to keep n, r 2 < tq- Mutation, on the 
other hand, drives the distribution towards the average 
number of mismatches in a random binding sequence, 
n = r 2 = (A ~ 1)L/A, as indicated by the arrows in 
Fig. [7| We are interested in the stationary distribu- 
tion iVo(ri,r2) that arises as a balance between selec- 
tion and mutation. Below we characterize the depen- 
dence of No(ri,r 2 ) on the effective selection pressure 
a = 2a/ v and the synergism factor ui numerically by 
iterating Eq. (|2T|l . However, we first anticipate the qual- 
itative behavior of No(r±,r 2 ) using the understanding of 
the single site problem gained in the last section. 

Let us neglect a possible asymmetry between the sites 
for the moment, i.e. we set a = 1. It is clear that if 5 is 
below a certain threshold value, no motif will be main- 
tained, i.e. the peak of the stationary distribution will be 
close to ri = r 2 = (A — l)L/A. On the other extreme, 
when a is very large, the distribution will certainly be 
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FIG. 8 Qualitative behavior of the stationary two-site mis- 
match distribution when the effective selection pressure a is 
increased. The circles indicate the position of the peak of 
the distribution, while the number inside the circle signifies 
the number of conserved motifs in the respective state. The 
dashed arrow indicates the direct pathway from the 0-motif 
state to the 2-motif state, while the solid arrows indicate the 
pathway through the intermediate 1-motif state (see discus- 
sion in the text). 



localized on the high htness mesa, corresponding to two 
conserved binding motifs. By analogy with the single site 
case, we would expect the distribution to be maximally 
fuzzy in this regime, and hence the peak of the stationary 
distribution to be close to ri = = tq. What happens 
when a takes on intermediate values? Upon increasing 
5, the peak of the stationary distribution may either pass 
directly from r\ = T2 = {A — l)L/A to T\ = T2 = Tq or 
go through a state with only one conserved motif (see 
Fig. [SJ. Intuitively, which of these "pathways" is taken, 
should depend on the value of to: when u> is small, the 
selective advantage of two conserved motifs over one is 
small and therefore a much higher selection pressure is 
needed to stabilize two motifs against mutations than 
just one, i.e. upon increasing the selection pressure the 
system passes from to 1 to 2 motifs. Conversely, when 
uj is very large, two motifs are actually stabilized at lower 
selection pressures than a single motif would be, and 
hence the system passes directly from to 2 motifs. We 
can estimate the value uj c at which the system switches 
between the two pathways: when (u> — 1) <C 1, the 1- 
motif phase exists in an intermediate range of a's, i.e. 
a c \ < a < a C 2, where the lower critical value for the 
transition from to 1 motif is approximately the same 
as in the single-site problem, i.e. a c i ~ 1, and the up- 
per critical value is a C 2 ^ (lo — l)^ 1 , since the transition 
from 1 to 2 motifs may be regarded as another single- 
site problem with a replaced by (lj — l)a. The system 
switches between the two pathways when a c \ = a C 2, and 
hence lu c rj 2. 

In the 1-motif phase, the selection of either motif, at 
site 1 or site 2, is equiprobable as long as a — 1. Corre- 




FIG. 9 Stationary distribution ATofri., ra) in the 0-motif, 1- 
motif, and 2-motif phase (a = 0.2, 5, and 50, respectively, 
and uj = 1.1). 



spondingly, No(ri,r2) has two equal peaks as indicated 
in Fig. |SJ When a < 1 the peak associated with site 2 
will disappear, but otherwise the qualitative behavior of 
the model remains the same. For simplicity, we will keep 
a = 1 from here on. 

In order to examine the qualitative picture outlined 
above and to render it more quantitative, we now char- 
acterize the behavior of No(ri,r2) numerically using the 
parameters tailored to CRP, i.e. L = 10 and ro = 3. 
To determine -/Vo(ri,r2), we evolve an arbitrary initial 
distribution N(ri,r2,t — 0) using Eq. (|21|l until the sta- 
tionary state is reached. Fig. [5] displays three such sta- 
tionary distributions, one each in the 0-motif, 1-motif, 
and 2-motif phase (here, we used a = 0.2, 5, and 50, 
together with oj = 1.1). Besides justifying the schematic 
sketch in Fig. it shows that the distributions both in 
the 1- and 2-motif phase are peaked at the "edge" r = 3 
and are therefore maximally fuzzy as in the single site 
problem. 

Next, we focus on the transitions between the three 
phases. In Fig. 1101 the average total number of matches, 
i.e. 2L— (ri)— fa) (here (. . .} denotes an average over the 
stationary distribution), is plotted against a, again with 
u> = 1.1 (solid line). [Note that in this hgure the y-axis 
is reversed compared to Fig. 0]] We observe that the to- 
tal number of matches rises quite sharply around 5 = 1 
and a — 10. These positions agree with our estimates 
a c i ~ 1 and a C 2 ~ (w — 1) _1 based on the qualitative 
discussion above. (Note that since we work with a small 
'system size' of L — 10, the transitions, which are sharp 
in the limit L — > 00, appear only as smooth crossovers.) 
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FIG. 10 Total number of matches in the two sites as a func- 
tion of a (solid line) , together with the number of matches in 
each site at the peak of the stationary distribution (first mo- 
tif: dots, second motif: diamonds). [Note that in this figure 
the y-axis is reversed compared to Fig. [I]] 
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FIG. 11 Phase diagram in (a, o;)-space. The boundaries are 
given by a c i (circles) and a C 2 (triangles) as a function of lo. 



In order to show that the rises are indeed caused by the 
transitions from the 0-motif to the 1-motif phase, and 
from the 1-motif to the 2-motif phase, respectively, we 
also plotted the number of matches in each site at the 
peak (at one of the two peaks in the 1-motif phase) of 
the stationary distribution in Fig. I1UI (circles and trian- 
gles) . This illustrates the typical behavior of the individ- 
ual sites: the first site jumps from 2 ~ 3 matches to 7 
matches at 5 « 1 and the second site does the same at 
a« 10. Hence, the motifs clearly appear in a step-like 
fashion as the selection pressure is increased. 

In evolution experiments with RNA viruses, this 
twofold transition should be directly observable (if a 
suitable operon can be identified where the fitness func- 
tion 119|) is applicable), since according to our estimates 
above, 5 for these systems can be tuned over a range of 
1 ~ 100 by varying the fractional exposure to different 
environmental conditions. On the basis of our model, 
one would expect, for instance, that one of the sites in a 
doublet disappears in the course of an evolution experi- 
ment, if the selection pressure is sufficiently lowered by 
reducing the exposure to the environment where binding 
is beneficial. When the exposure is reduced to zero, both 
regulatory sites and the gene coding for the transcription 
factor (if not required for other mechanisms) will be lost. 

To complete our characterization of the model behav- 
ior, we map out the entire phase diagram in the (5, lo) 
parameter-space. The result is shown in Fig. ^] where 
a c i and a C 2 are plotted as a function of lo. Since L = 10 
in the present case and the phase boundary is well-defined 
only in the limit L — » oo, the curves a c i(u>) and 5 C 2(oj) 
are really only crossover lines whose precise location is 
slightly dependent on their definition (in the figure they 
are represented by dashed lines in order to indicate this 
fact). [Here, we defined 5 c i and 5 C 2 as the value of a 



where the peak of the stationary distribution first reaches 
7 matches in the respective site.] We see that the phase 
boundaries join at lo w 2 as expected from our estimate 
above. When lo — > 1 from above, the upper bound- 
ary, a C 2, diverges as (lo — 1) . This implies that at 
to = 1, the two- motif phase is unreac hable, regardless 
of the value of a. Finally, we note that iHermisson et alJ 
(2001) observe a somewhat similar two-fold transition be- 
tween three phases in a situation where two different mu- 
tation processes with distinct mutation rates are taken 
into account. 

Let us now return to the scenario outlined at the be- 
ginning of this section, and discuss whether we can inter- 
pret the regulatory regions with multiplets as being under 
higher selective pressure for factor binding than the ones 
with single binding sites. Our study of the two-site prob- 
lem would justify this conclusion, if (a) the values of lo 
and a were very similar for all regulatory regions, and (b) 
the effective selection pressure were typically on the same 
order of magnitude as a C 2- Then, we could tentatively as- 
sociate singlets with an a below S C 2, and multiplets with 
an a above that threshold. However, both conditions are 
not likely to be fulfilled generically in bacteria. First, the 
values of lo and a not only depend on the sequence of the 
promoter and the distance of the binding sites from the 
promoter, but also on the level of gene expression that 
is beneficial for the cell. For example, genes that code 
for proteins which are not needed in large amounts, even 
in the environmental condition where their expression is 
favored, do not require a large activation, and hence the 
effect of a second binding site could even be detrimental, 
i.e. lo < 1. Second, a should typically be on the order of 
10 4 (see above), and hence, as long as lo is only slightly 
larger than one (lo > 1.0001), our model would always 
predict multiple binding sites for bacterial transcription 
factors. Therefore, within our model, whether one or two 
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motifs are maintained depends almost exclusively on the 
value of u>, i.e. to < 1 leads to one motif and u> > 1 to 
two motifs. However, there may be cases where maintain- 
ing subtle differences in the temporal ordering of turning 
on/off different operons would give rise to a very small 
fitness advantage, e.g. flagella assembly and SOS-rcspnse 
in E. coli (see recent results by U. Alon, submitted for 
publication). In such cases, the system may respond by 
keeping one or two motifs according to the theory we 
presented. And of course there is also the situation of 
RNA viruses described above where the twofold transi- 
tion as depicted in Fig. ^| could in principle be directly 
observed. 



V. DISCUSSION AND CONCLUSIONS 

The fuzziness of regulatory binding motifs is a widely 
observed phenomenon. The present investigation has 
shown that the entropic advantage of introducing mis- 
matches from the best binding sequence is sufficient to 
produce motifs that are maximally fuzzy while still re- 
taining the capability of factor binding. Nevertheless, we 
cannot exclude that the fuzziness actually bears a selec- 
tive advantage (in the language of population genetics, 
this would correspond to a stabilizing selective pressure). 
The alternative scenario given for the fuzziness of the 
CRP sites is an explicit example of the latter. It would 
be very interesting to address this question experimen- 
tally by directly measuring the fitness of a bacterium or 
virus as a function of the sequence of its binding sites: 
starting with a wild-type binding site that has several 
mismatches, what is the effect on the fitness, when the 
number of mismatches is reduced by site-directed mu- 
tagenesis? Does the fitness remain unchanged or is it 
reduced? Besides answering the question raised above, 
experiments of that type could also lead to important 
conclusions on the evolution of genetic regulation. 

Another important result of our study is the phase 
transition associated with the maintenance of motifs. 
Our general conclusion is that the selection pressure on 
a single binding motif needs to surpass a threshold value 
of approximately vqL/2 to guarantee maintenance, while 
the threshold for a second site (for the same factor and in 
the same regulatory region) is larger by a factor (u>— 1) , 
where uj is given by the ratio of the fitness of the organ- 
ism with two sites over the fitness of the organism with 
one site. As pointed out above, this prediction could be 
tested experimentally by evolving RNA viruses in a fluc- 
tuating environment and varying the fractional exposure 
to the environment where factor binding is beneficial. In 
this case there would be no need to do site-directed mu- 
tagenesis, since the transition could be observed directly 
by sequencing. 

Our model makes quantitative predictions on the 
statistics of ■polymorphisms in binding sites. To test 
these, it will not suffice to sequence a particular bind- 
ing site in many different isolates from a single, large 



(Ni>o S> 1) laboratory population, since this population 
originates from a small, genetically homogeneous popu- 
lation and it takes a time on the order of v^ 1 to equi- 
librate the distribution of mismatches in a binding site. 
Instead, sequencing the same binding sites in several dif- 
ferent strains should yield the desired data. Besides al- 
lowing a comparison to our model, detailed information 
on polymorphisms in binding sequences would also make 
it possible to address a number of interesting questions, 
e.g. how does the selection pressure on binding sites com- 
pare with the selection pressure on coding sequences 10 ? 
Or, can one identify compensating mutations between 
promoter and binding site sequences, e.g. could a mu- 
tation that weakens the promoter be compensated by a 
mutation that increases the affinity of an activator to its 
operator site? 

We conclude that the evolution of transcription factor 
binding sites is a problem that is well suited to establish- 
ing a link between the detailed molecular biophysics of a 
system and its evolutionary dynamics. The theory pre- 
sented in the present article is meant as a first step, with 
the hope of stimulating future experiments in this direc- 
tion. There are many directions for the improvement 
of the model and the analysis. One important ques- 
tion is the validity of the mean-field analysis described 
here. Is the finite population size effect important here 
and how would it change the motif statistics within our 
model? One can also investigate more elaborate models 
including, for instance, the effect of time-dependent en- 
vironment, the coupled evolution of the polymerase and 
transcription factor the binding sites, and the interaction 
among different factors. 
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Note added: Upon completion of the pres ent work, 
we learned of the work of lSenguota et al.1 1)2002|) who also 
examined the protein-DNA interaction from an evolu- 
tionary perspective. While very similar mean-field evolu- 
tion equations are analyzed in both works, the emphasis 



For example, we would expect that the selection pressure on the 
coding sequence of the binding region in transcription factors 
such as crp or lexA, which have many binding sites distributed 
over the whole genome, is much higher than on individual op- 
erator sites, since a change in the sequence of an operator site 
affects only the regulation of that particular site, while a change 
in the aminoacid sequence of the binding region of the protein 
affects the regulation of many genes. 
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are quite different, with ISengupta et "all (|2002t) arguing 
for a "robustness" criterion based on minimizing muta- 
tional load. 
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